One of the objectives of our paper is to evaluate whether the literature of short terms health effects of air pollution suffers from power and bias issues.
In this particular document, we focus mainly on the epidemiology literature. We take advantage of a somehow standardized reporting mechanism to retrieve estimates and confidence intervals from abstracts. We then implement robustness tests in order to compute the power, type M and type S error in the studied articles. We look at what would be the power, type M and type S error if the true effect was a fraction of the measured effect.
We retrieved estimates and confidence intervals of articles in the literature of interest in another document. Before looking into the power analysis itself, we look at the characteristics of the articles considered.
We retrieved the articles using the following query:
‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide” OR “PM10” OR “PM2.5” OR “carbon dioxide” OR “carbon monoxide”) AND (“emergency” OR “mortality” OR “stroke” OR “cerebrovascular” OR “cardiovascular” OR “death” OR “hospitalization”) AND NOT (“long term” OR “long-term”)) AND “short term”’
This query returns 1624 articles. Based on the abstracts, we can briefly explore the main (unsurprising) themes of the articles:
Not all abstracts display effects and confidence intervals. We therefore want to assess whether there are noticeable differences between articles for which we retrieve confidence intervals and those for which we do not. This quick exploration will also provide additional information and descriptive statistics on the whole set of articles.
Out of all abstracts returned by the query, 700 display confidence intervals:
In these articles, we retrieve valid effects and confidence intervals in the following proportions:1
| Effect retreived | Number of articles | Proportion |
|---|---|---|
| Yes | 599 | 0.8557143 |
| No | 101 | 0.1442857 |
This corresponds to 1880 valid effects and associated confidence intervals.
Here is a random example of the effects and confidence intervals detected by our method (highlighted in gray):
In this subsection, we investigate whether there are systematic differences between articles displaying an effect that we detected in the abstract and articles that do not display an effect or for which we did not detect the effect.
We first wonder whether there are disparities in publication dates. It might be the case that displaying effects in the abstract was a feature of a given period.
Even though there are slightly more recent (2010-2020) articles for which effects are retrieved, the difference does not seem to be substantial. The first article for which an effect is detected was published in 1992. Not many articles were published on this topic before this date (we only find 30 articles) because most often air pollution has only been measured since the 1990s.
We then investigate whether there are differences in the journals in which the articles are published.
For this analysis to be informative, we would need to cluster the journals into groups (eg epidemiology journals, general science journals, etc).
Then, we wonder if the words used in each types of abstracts differ.
Apart from a few key terms, such as CI, 95 for instance, there are no huge differences in the terms used in both types of abstracts.
It seems that, when there are enough articles, our propensity to detect an effect does not seem to vary too much with the type of pollutant. Note that if an article considers several pollutants, it will appear several times in this graph.
Now that we have quickly compared the articles for which we retrieve an effect an those for which we do not, we can dig further into the analysis of the estimates retrieved.
In this section, we briefly analyse the effects retrieved. First, we look into the proportion of effects which are significant.
Non surprisingly, most of the effects retrieved here are significant. These effects are reported in the abstracts and with confidence intervals.
We the look into the distribution of the t-scores.
There seems to be some sort of bunching for t-scores above 1.96. In this analysis, we only consider estimates reported in the abstracts. Authors may only report significant estimates in their abstracts even though they also report non significant estimates in the body of the article. This might explain this bunching. We need to investigate this further in order to understand whether this bunching is evidence of publication bias. We could investigate this further by reproducing the present analysis but analyzing the full texts and not only on the abstracts.
We then plot the distribution of the signal to noise ratio, ie the ratio of the point estimate and the width of the confidence interval.
The graph is of course analogous to the previous one. It however informs us that in a large share of the studies, the magnitude of the noise is larger than the magnitude of the effect. Looking in more details into the distribution of the signal to noise ratio, we notice that for 40% of the estimates considered here, the magnitude of the noise is more important than those of the signal.
'tidy.numeric' is deprecated.
See help("Deprecated")
| Signal to noise ratio | Percentage with a lower signal to noise ratio |
|---|---|
| 0.0322581 | 0% |
| 0.5384542 | 10% |
| 0.6564957 | 20% |
| 0.8215482 | 30% |
| 1.0241936 | 40% |
| 1.3454994 | 50% |
| 2.2152047 | 60% |
| 4.6094737 | 70% |
| 9.8214508 | 80% |
| 23.8192248 | 90% |
| 834.8333333 | 100% |
We then turn to the power analysis itself. The objective is to evaluate the power, type M and type S errors for each estimate.
To compute these values, we would need to know the true effect size. Yet, in general, we do not know what the true effect is. It would be particularly challenging to retrieve what is exactly measured in each analysis since there is no standardized way of reporting the results. A study may for instance claim that a 10 \(\mu g/m^{3}\) increase in PM2.5 concentration leads to an increase of x% in hospital admissions over the course of a year while another study may state that a 2% increase in ozone concentration increases the number of deaths by 3 over a month. For each estimate retrieved, even though we do not know what is measured, we can evaluate the precision with which it is estimated.
To circumvent the fact that we do not know the actual effect size, we follow the strategy suggested by Gelman and Carlin (2014). We consider different potential “true” effect sizes and run robustness checks, to investigate what would be the power, type M and type S error if the true effects were only a fraction of the measured effect. The results are thus only informative. There is no reason to think a priori that a given effect would be overestimated. Yet, if by assuming that the true effect is 3/4 of the measured effect, we find that the estimation is likely to be overestimated by a factor of 2, there might be a substantial issue with this estimate.
To do so, we use the package retrodesign which computes post analysis design calculations (power, type M and type S errors). We run the function retro_desing() for several effect sizes.
Problem with `mutate()` input `retro_0.01`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.01` is `as_tibble(retro_design(effect * 0.01, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.05`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.05` is `as_tibble(retro_design(effect * 0.05, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.1`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.1` is `as_tibble(retro_design(effect * 0.1, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.33`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.33` is `as_tibble(retro_design(effect * 0.33, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.5`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.5` is `as_tibble(retro_design(effect * 0.5, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.67`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.67` is `as_tibble(retro_design(effect * 0.67, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.75`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.75` is `as_tibble(retro_design(effect * 0.75, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_0.9`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_0.9` is `as_tibble(retro_design(effect * 0.9, se))`.the condition has length > 1 and only the first element will be usedProblem with `mutate()` input `retro_1`.
ℹ the condition has length > 1 and only the first element will be used
ℹ Input `retro_1` is `as_tibble(retro_design(effect * 1, se))`.the condition has length > 1 and only the first element will be used
In a first part, we carry out our analysis on the whole set of abstracts. We notice that there is some heterogeneity across articles, some articles displaying a high power and others displaying lower power. Thus, in a second part, we will look in more details at articles displaying low power.
We start by computing the average and median power, type M and type S errors for a set of “true” effects.
| Mean | Median | Mean | Median | Mean | Median | |
|---|---|---|---|---|---|---|
| 1% of the measured effect | 0.1031186 | 0.0503187 | 56.366388 | 44.340189 | 0.3400676 | 0.4386578 |
| 5% of the measured effect | 0.2496683 | 0.0580046 | 11.426730 | 8.937214 | 0.1908713 | 0.2255850 |
| 10% of the measured effect | 0.3391992 | 0.0824306 | 5.873157 | 4.548757 | 0.1097782 | 0.0780555 |
| 33% of the measured effect | 0.5459089 | 0.4132677 | 2.104172 | 1.541507 | 0.0144482 | 0.0002604 |
| 50% of the measured effect | 0.6602571 | 0.7508618 | 1.585378 | 1.160191 | 0.0055088 | 0.0000029 |
| 67% of the measured effect | 0.7545065 | 0.9422312 | 1.349583 | 1.034729 | 0.0029794 | 0.0000000 |
| 75% of the measured effect | 0.7913018 | 0.9770161 | 1.281254 | 1.014091 | 0.0024007 | 0.0000000 |
| 90% of the measured effect | 0.8478207 | 0.9973378 | 1.193230 | 1.001735 | 0.0017218 | 0.0000000 |
| 100% of the measured effect | 0.8774069 | 0.9995402 | 1.153525 | 1.000312 | 0.0014286 | 0.0000000 |
Then, we explore graphically the distribution of power, type M and type S error across simulation and for different size of true effect.
A large chunk of articles display high power and low rates of type M and type S error, in each robustness check. However, a non negligible number of articles display lower power and/or some evidence of type M error. Type S error does not seem to be an important issue in this literature. We investigate potential driver of low power and type M errors further in the next subsection.
Note that for type M errors, due to some outliers, we used a log scale. Without this log scale and restricting our sample to type M errors lower than 2.5 (95% of our sample, even when we assume that the true effect is only 1/3 of the estimated one).
We find that, even if the measured effect is the true effect, there is some risk of type M error.
The ECDF also provide useful information on the distribution of power, type M and type S errors across studies.
We notice that about 50% of studies would be underpowered at the conventional 80% level if we considered that the true effect was half the measured effect.
Then, we look how type M and type S error evolve with power for the estimates considered.
There is a one-to-one relationship between power and type M and type S error. Not surprisingly, type M and type S error skyrocket in studies with low power.
We then investigate how average power, type M and type S evolve as a proportion of the true effect size.
Power, decreases and type M and type S errors skyrocket for small values of the true effect (as a proportion of the measured effect). In addition on average, if for each paper of the literature, the true effects are 3/4 of the measured effect, the power would be lower than the usual 80%. Type S error only seem to be an issue for small values of the true effect as a portion of the measured effect. Type M error seems to be more consistently problematic. The shoot up in the previous graph makes it difficult to read the values of type M error when the true effect is not a small portion of the measured effect. We therefore zoom in.
Warning in gzfile(file, "wb") :
cannot open compressed file '/Users/vincentbagilet/Documents/Research/imputation_pollution/.Rproj.user/shared/notebooks/C00D6ECC-systematic_lit_review_analysis/1/30CCFBC96E213B15/c7ws64xz8w4o6_t/2032840d23ee4afeb08b61751f257b4f.snapshot', probable reason 'No such file or directory'
Error in gzfile(file, "wb") : cannot open the connection
Error in (function (which = dev.cur()) :
QuartzBitmap_Output - unable to open file '/Users/vincentbagilet/Documents/Research/imputation_pollution/.Rproj.user/shared/notebooks/C00D6ECC-systematic_lit_review_analysis/1/30CCFBC96E213B15/c7ws64xz8w4o6_t/_rs_chunk_plot_001.png'
We notice that, on average in the literature, the treatment effects are overestimated, even for large values of the true effect. This result might be linked to some outliers. We thus look at the evolution of the median effect with true effect size.
We notice that the issue is much less important when looking at the median. This suggests some heterogeneity in terms of power in the literature.
To confirm that, we look into the evolution of the distribution with the proportion of effect size.
The overal distribution of power seems almost bimodal: either the power of most is very high or it is very low.
It might also be interesting to look at how power, type M and type S error evolved in time, ie with publication date.
There does not seem to be a clear trend in the evolution of power and type S error. However, type M error seems to have peaked in the 2010s and to be decreasing again recently.
In the previous section, we noticed that a non negligible number of studies seemed to suffer from a low power issue and associated type M error. We consider that an estimate has low power if its computed power is lower than 80% if the true effect is 3/4 of the measured effect. 80% is the threshold usually used in power analyses but 3/4 is arbitrary and could be changed easily in a robustness check. Following this criterion, the number and proportion of estimates with low power is as follows:
| Power | Number of estimates | Proportion |
|---|---|---|
| Adequate power | 1186 | 0.6308511 |
| Low power | 694 | 0.3691489 |
We investigate the particularities of the articles with low power. We start by reproducing the analyses used to compare articles for which we retrieved an effect and those for which we did not. First, we look into the distribution of publication dates.
It seems that less articles with low power have been published recently, in comparison to articles with adequate power. This confirms our previous finding. We then look into the distribution of articles
Interestingly, some journals, such as “Science of the Total Environment”, the “International Journal of Occupational Medicine and Environmental Health”, the “Chochrane Database of Systematic Reviews”, “Environmental science and pollution research” and the “Journal of Exposure Science and Environmental epidemiology” publish large share of low power studies. On the contrary, BMJ Open publish very few low power studies.
Here also, grouping the journals into big main themes could be more instructive.
There does not seem to be a clear trend in the proportion of articles with low power. If anything it has slightly decreased in the last decade.
We also look into potential disparities in terms of pollutant
There does not seem to be stark differences by pollutant type.
We then compare these outcomes in terms of outcome (mortality or hospital admissions).
There is absolutely no difference along this dimension.
Finally, we wonder whether power depends on the length of the study period. It probably depends on the number of observations but since retrieving this information is difficult. We do not know how many cities are analysed in each study.
── Column specification ────────────────────────────────────────────────────────────────
cols(
name = col_character(),
country = col_character(),
subcountry = col_character(),
geonameid = col_double()
)
Note that a bunch of abstracts contain the phrase “CI” without actually displaying effects and confidence intervals.↩︎